Build your own recommendation system for products on an e-commerce website like Amazon.com. Online E-commerce websites like Amazon, Filpkart uses different recommendation models to provide different suggestions to different users.
Amazon currently uses item-to-item collaborative filtering, which scales to massive data sets and produces high-quality recommendations in real time. This type of filtering matches each of the user's purchased and rated items to similar items, then combines those similar items into a recommendation list for the user. In this project we are going to build recommendation model for the electronics products of Amazon. The dataset here is taken from the below website. Source - Amazon Reviews data (http://jmcauley.ucsd.edu/data/amazon/) The repository has several datasets. For this case study, we are using the Electronics dataset.
Dataset columns - first three columns are userId, productId, and ratings and the fourth column is timestamp. You can discard the timestamp column as in this case you may not need to use it.
@Misc{Surprise,
author = {Hug, Nicolas},
title = { {S}urprise, a {P}ython library for recommender systems},
howpublished = {\url{http://surpriselib.com}},
year = {2017}
}
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from plotly.offline import init_notebook_mode, plot, iplot
import plotly.graph_objs as go
init_notebook_mode(connected=True)
from surprise import SVD
from surprise.model_selection import cross_validate
#import Recommenders as Recommenders
#import Evaluation as Evaluation
%matplotlib inline
df = pd.read_csv('ratings_Electronics.csv',names=['userId','productId','ratings','timestamp'])
df.head()
df = df.drop('timestamp', axis=1)
df.info()
df.describe()
print("No of Nan values in our dataframe : ", sum(df.isnull().any()))
dup_bool = df.duplicated(['userId','productId','ratings'])
dups = sum(dup_bool) # by considering all columns..( including timestamp)
print("There are {} duplicate rating entries in the data..".format(dups))
print("Total data ")
print("-"*50)
print("\nTotal no of ratings :",df.shape[0])
print("Total No of Users :", len(np.unique(df.userId)))
print("Total No of Products :", len(np.unique(df.productId)))
p = df.groupby('ratings')['ratings'].agg(['count'])
# get Products count
prod_count = df.isnull().sum()[1]
# get users count
user_count = df['userId'].nunique() - prod_count
# get rating count
rating_count = df['userId'].count() - prod_count
ax = p.plot(kind = 'barh', legend = False, figsize = (15,10))
plt.title('Total pool: {:,} Products, {:,} users, {:,} ratings given'.format(prod_count, user_count, rating_count), fontsize=20)
plt.axis('off')
for i in range(1,6):
ax.text(p.iloc[i-1][0]/4, i-1, 'Rating {}: {:.0f}%'.format(i, p.iloc[i-1][0]*100 / p.sum()[0]), color = 'white', weight = 'bold')
from this, We can see that 56% of all ratings in the data are 5, and very few ratings are 2 and 3, low rating products mean they are generally really bad.
Most of the people, rate the products if it is really bad(rating 1: 12%) or if it is very good(rating 4 and 5).
Most people don't normally gives average (2/3) rating
no_of_rated_products_per_user = df.groupby(by='userId')['ratings'].count().sort_values(ascending=False)
no_of_rated_products_per_user.head()
# Number of ratings per user
data = df.groupby('userId')['ratings'].count().clip(upper=50)
# Create trace
trace = go.Histogram(x = data.values,
name = 'ratings',
xbins = dict(start = 0,
end = 50,
size = 2))
# Create layout
layout = go.Layout(title = 'Distribution Of Number of Ratings Per User (Clipped at 50)',
xaxis = dict(title = 'Ratings Per User'),
yaxis = dict(title = 'Count'),
bargap = 0.2)
# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)